-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Conversation
Wow 👍 |
util/network-devp2p/src/discovery.rs
Outdated
assert_eq!(data[0], PACKET_PING); | ||
let rlp = Rlp::new(&data[1..]); | ||
assert_eq!(ep2, NodeEndpoint::from_rlp(&rlp.at(1).unwrap()).unwrap()); | ||
assert_eq!(ep1, NodeEndpoint::from_rlp(&rlp.at(2).unwrap()).unwrap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Ping
packet received, the node that sent the request is now added to the NodeTable without going through another Ping/Pong
round.
Amazing feature, thanks 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool stuff!
Other general comments: splitting in_flight_pings
and in_flight_find_nodes
into separate maps makes a lot of sense to me. The expiring_{pings,find_nodes}
was a more efficient data structure for checking expirations, but probably an unnecessary optimization given the anticipated size of the maps.
util/network-devp2p/src/discovery.rs
Outdated
@@ -115,6 +115,7 @@ pub struct Discovery<'a> { | |||
id_hash: H256, | |||
secret: Secret, | |||
public_endpoint: NodeEndpoint, | |||
discovering: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of a separate bool, I think making discovery_round
an Option<u16>
is cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitly, thanks!
}); | ||
self.in_flight_find_nodes.retain(|node_id, find_node_request| { | ||
if time.duration_since(find_node_request.sent_at) > FIND_NODE_TIMEOUT { | ||
if !find_node_request.answered { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rationale for requiring exactly BUCKET_SIZE results is discussed here and matches behavior in Geth. Can you explain why you prefer not to expire if there is an insufficient number of responses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the issue was for networks where you don't have more than 15 nodes. The requests would time out, and you would start rejecting every node since they are not well connected enough.
Actually, I don't think it matches Geth behaviour; from what I've read, it seems that it only errors if you receive 16 or more nodes, nothing regarding expiration.
Anyway, I think it actually make sense to not expire a request that have been answered, even if it's less answers than expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think that the first bit is checking that the result is actually < 16 (it errors if nreceived >= bucketSize
) ; and the second I'm not 100% sure but I think it's a general timeout mechanism, not related to FindNode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you see that it errors? In my reading of it, this code marks the request as complete once nreceived >= bucketSize
and does nothing (waiting for more responses) if it's not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm I think you are right, sorry I misread the code... Even so, I think the mechanism in this PR is better.
util/network-devp2p/src/discovery.rs
Outdated
// the address from the incoming connection | ||
Some(NodeEntry { | ||
id: *node_id, | ||
endpoint: NodeEndpoint { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, why use from
at all? If the ping was sent to request.node
and we got a response, seems like we should just keep using that.
util/network-devp2p/src/discovery.rs
Outdated
node: NodeEntry, | ||
// The hash sent in the Ping request | ||
echo_hash: H256, | ||
// The hash Parity used to respond with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you include in the comment the Parity version that this was fixed in so this can be removed eventually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was added in #8042 and shipped in 1.11.0. AFAIK geth just ignores this value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty see Geth checks the hash. See here.
util/network-devp2p/src/discovery.rs
Outdated
let hash = self.send_packet(PACKET_PING, &node.endpoint.udp_address(), &rlp.drain())?; | ||
|
||
let request_info = PendingRequest { | ||
packet_id: PACKET_PING, | ||
// Add the request to inflight pings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This comment seems obvious to me from the code below.
util/network-devp2p/src/discovery.rs
Outdated
}; | ||
// Here the PONG's `To` field should be the node we are | ||
// sending the request to | ||
ping_to.to_rlp_list(&mut response); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this line should be deleted and replaced with the one below, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah actually it should be, but the thing is that anyway this field should not be used, but old Parity versions (older than this PR, that is) use it in order to get the node's address. So this is a temporary fix so that older Parity versions don't brake completely. I'll add a comment, thanks for the heads up.
util/network-devp2p/src/discovery.rs
Outdated
} | ||
|
||
Ok(None) | ||
Ok(self.update_node(entry.clone())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the change from add_node
to update_node
? According to this, the node should only be added to the routing table after we receive a PONG (not just a PING).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To quote your link: When a ping packet is received, the recipient should reply with a pong packet. It may also consider the sender for addition into the node table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it's just another unnecessary round of Ping-Pong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, add_node
considers the sender for inclusion, but only after obtaining Endpoint Proof, as it is referred to in the doc.
I disagree that it is unnecessary. The purpose of PINGing the node is to ensure that it is listening at the network address it claims to be and participating in the protocol. When receiving a single UDP packet, the origin address can be spoofed, or the packet could be sent without the sender actually listening on the network. The linked document details the difficulty of verifying that a peer is honestly running a discovery server, but PINGing is a good heuristic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that Geth doesn't Ping back on received Ping, it just adds it to their NodeTable .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really know how devp2p works, but I guess that it is useful to wait for the PONG, as nodes behind a NAT (which is probably not uncommon) may report an unreachable address.
let timestamp: u64 = rlp.val_at(3)?; | ||
self.check_timestamp(timestamp)?; | ||
|
||
let mut response = RlpStream::new_list(3); | ||
dest.to_rlp_list(&mut response); | ||
let pong_to = NodeEndpoint { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this different from ping_from
? It makes more sense to me to respond to the endpoint that the sender claims is their address than mixing and matching it with the from
SocketAddr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this isn't really used from the implementations that I've read, but I think the idea is that the recipient of the Pong knows what we think his address is. There is no point in copying a field of the packet he just sent us, he probably knows it already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it's used below as the entry that add_node
is called on. Whereas I think ping_from
is a better choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I thought it was to on_pong
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be actually quite a bad idea to use the ping_from
that we have been sent, since it's mostly the peer local address, such as V4(0.0.0.0:42786)
or V6([::]:30303)
util/network-devp2p/src/discovery.rs
Outdated
// } | ||
if expected_node.is_none() { | ||
let mut found = false; | ||
self.in_flight_pings.retain(|_, ping_request| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would this be the case? Iterating through the whole in_flight_pings
map on every unexpected PONG doesn't seem like a good idea to me.
If there's a good reason for this, the map might as well be keyed on the ping hash (except that it would be harder to maintain compatibility with the deprecated_echo_hash
thing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is to cater for the fact that the node answering your Ping might not have the NodeID you've been told. Another mechanism would be to Ping back any unknown received Pong ; I can try that if it sounds better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the NodeID is wrong, then the address (NodeID + Network Endpoint) is wrong, so I think it's fine to just drop the PONG.
Consider also that the sender of the PONG will likely PING back to us after receiving the PING, from which we can get the correct NodeID, so this unlikely case should resolve itself anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I'll remove it from now. Note that Geth doesn't Ping back on received Ping, it just adds it to their NodeTable .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm the same nice peer-count behavior.
util/network-devp2p/src/discovery.rs
Outdated
@@ -115,13 +128,14 @@ pub struct Discovery<'a> { | |||
id_hash: H256, | |||
secret: Secret, | |||
public_endpoint: NodeEndpoint, | |||
started_discovering: bool, | |||
discovering: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two bools look like their doing the same thing? Could we ever be discovering == true
but started_discovering == false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So started_discovery
is used in order to start the first round of discovery ever. The idea is that it is set to false
at the beginning, then we add some peers (bootnodes), ping them, and as soon as the pings are answered or timedout, we start the first discovery round. Then we start a new round every 10s or 30s depending on the number of peers. Without this, we would have to wait for an extra 10s for it to run for the first time.
The discovering
bool is used to prevent starting a new discovery round while there is already one going.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, much clearer now. Can't really think of a better name, maybe s/started_discovering
/discovery_initiated
/. Or just a comment here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds better, thanks!
util/network-devp2p/src/discovery.rs
Outdated
|
||
trace!(target: "discovery", "Sent Ping to {:?}", &node.endpoint); | ||
trace!(target: "discovery", "Sent Ping to {:?} ; node_id=0x{:x}", &node.endpoint, node.id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: node_id={:#x}
util/network-devp2p/src/discovery.rs
Outdated
} | ||
Entry::Vacant(_) => false, | ||
Entry::Vacant(_) => { | ||
debug!(target: "discovery", "Got unexpected Neighbors from {:?} ; couldn't find node_id=0x{:x}", &from, node_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use node_id={:#x}
util/network-devp2p/src/discovery.rs
Outdated
trace!(target: "discovery", "Got Pong from {:?}", &from); | ||
let dest = NodeEndpoint::from_rlp(&rlp.at(0)?)?; | ||
trace!(target: "discovery", "Got Pong from {:?} ; node_id=0x{:x}", &from, node_id); | ||
let _pong_to = NodeEndpoint::from_rlp(&rlp.at(0)?)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is here just for the rlp-decoding of the subsequent values right? Maybe just use _
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was actually just in case we figure out why this value is useful, as I cannot.
util/network-devp2p/src/discovery.rs
Outdated
@@ -115,13 +128,14 @@ pub struct Discovery<'a> { | |||
id_hash: H256, | |||
secret: Secret, | |||
public_endpoint: NodeEndpoint, | |||
started_discovering: bool, | |||
discovering: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two booleans look like they're doing the same thing, i.e. could we ever be in a state where discovering
is true
and started_discovering
is false
?
util/network-devp2p/src/discovery.rs
Outdated
request.response_count += results_count; | ||
true | ||
} else { | ||
debug!(target: "discovery", "Got unexpected Neighbors from {:?} ; oversized packet ({} + {}) node_id=0x{:x}", &from, request.response_count, results_count, node_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can use {#:x}
instead of 0x{:x}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixes #9549 for me.
trace!(target: "discovery", "Got Pong from {:?}", &from); | ||
let dest = NodeEndpoint::from_rlp(&rlp.at(0)?)?; | ||
trace!(target: "discovery", "Got Pong from {:?} ; node_id={:#x}", &from, node_id); | ||
let _pong_to = NodeEndpoint::from_rlp(&rlp.at(0)?)?; | ||
let echo_hash: H256 = rlp.val_at(1)?; | ||
let timestamp: u64 = rlp.val_at(2)?; | ||
self.check_timestamp(timestamp)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should check error here and remove from in_flight_ping
if there is an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will just go expired after a few 100ms, so I think it's fine as-is.
let timestamp: u64 = rlp.val_at(3)?; | ||
self.check_timestamp(timestamp)?; | ||
|
||
let mut response = RlpStream::new_list(3); | ||
dest.to_rlp_list(&mut response); | ||
let pong_to = NodeEndpoint { | ||
address: from.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we change a bit our code. Previously (if I am not mistaken), it was done in on_pong
only if !node.endpoint.is_valid()
. I think it is better to use socket address directly like done here (getting info from the transport layer), yet let's keep this change in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments, the two more important ones are :
- the change in
DISCOVERY_REFRESH
(no cleaning if I read the code correctly) - the initialization of address on ping from the socket
Other comments are mostly grumble or open question.
Check in flight FindNode before pings
failing tests are unrelated. anyone else want to give this a final review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
// sending the request to | ||
// WARNING: this field _should not be used_, but old Parity versions | ||
// use it in order to get the node's address. | ||
// So this is a temporary fix so that older Parity versions don't brake completely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/brake/break/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
if request.packet_id == PACKET_FIND_NODE && | ||
request.response_count + results_count <= BUCKET_SIZE | ||
{ | ||
// Mark the request as answered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This comment feels unnecessary to me given the line below.
* Add `target` to Rust traces * network-devp2p: Don't remove discovery peer in main sync * network-p2p: Refresh discovery more often * Update Peer discovery protocol * Run discovery more often when not enough nodes connected * Start the first discovery early * Update fast discovery rate * Fix tests * Fix `ping` tests * Fixing remote Node address ; adding PingPong round * Fix tests: update new +1 PingPong round * Increase slow Discovery rate Check in flight FindNode before pings * Add `deprecated` to deprecated_echo_hash * Refactor `discovery_round` branching
…-fix-windows * 'master' of https://github.com/paritytech/parity: while working on the platform tests make them non-breaking (openethereum#9563) Improve P2P discovery (openethereum#9526) move dockerfile for android build container to scripts repo (openethereum#9560) Simultaneous platform tests WIP (openethereum#9557)
* Add `target` to Rust traces * network-devp2p: Don't remove discovery peer in main sync * network-p2p: Refresh discovery more often * Update Peer discovery protocol * Run discovery more often when not enough nodes connected * Start the first discovery early * Update fast discovery rate * Fix tests * Fix `ping` tests * Fixing remote Node address ; adding PingPong round * Fix tests: update new +1 PingPong round * Increase slow Discovery rate Check in flight FindNode before pings * Add `deprecated` to deprecated_echo_hash * Refactor `discovery_round` branching
* parity-version: mark 2.0.5 track stable * deps: bump fs-swap to 0.2.4 * Remove initial token for WS. (#9545) * version: mark release critical * Increase Gas-floor-target and Gas Cap (#9564) + Gas-floor-target increased to 8M by default + Gas-cap increased to 10M by default * Improve P2P discovery (#9526) * Add `target` to Rust traces * network-devp2p: Don't remove discovery peer in main sync * network-p2p: Refresh discovery more often * Update Peer discovery protocol * Run discovery more often when not enough nodes connected * Start the first discovery early * Update fast discovery rate * Fix tests * Fix `ping` tests * Fixing remote Node address ; adding PingPong round * Fix tests: update new +1 PingPong round * Increase slow Discovery rate Check in flight FindNode before pings * Add `deprecated` to deprecated_echo_hash * Refactor `discovery_round` branching * net_version caches network_id to avoid redundant aquire of sync read lock (#9544) * net_version caches network_id to avoid redundant aquire of sync read lock, #8746 * use lower_hex display formatting for net_peerCount rpc method
* parity-version: mark 2.1.0 track beta * ci: update branch version references * docker: release master to latest * Fix checkpointing when creating contract failed (#9514) * ci: fix json docs generation (#9515) * fix typo in version string (#9516) * Update patricia trie to 0.2.2 crates. Default dependencies on minor version only. * Putting back ethereum tests to the right commit * Enable all Constantinople hard fork changes in constantinople_test.json (#9505) * Enable all Constantinople hard fork changes in constantinople_test.json * Address grumbles * Remove EIP-210 activation * 8m -> 5m * Temporarily add back eip210 transition so we can get test passed * Add eip210_test and remove eip210 transition from const_test * In create memory calculation is the same for create2 because the additional parameter was popped before. (#9522) * deps: bump fs-swap and kvdb-rocksdb * Multithreaded snapshot creation (#9239) * Add Progress to Snapshot Secondary chunks creation * Use half of CPUs to multithread snapshot creation * Use env var to define number of threads * info to debug logs * Add Snapshot threads as CLI option * Randomize chunks per thread * Remove randomness, add debugging * Add warning * Add tracing * Use parity-common fix seek branch * Fix log * Fix tests * Fix tests * PR Grumbles * PR Grumble II * Update Cargo.lock * PR Grumbles * Default snapshot threads to half number of CPUs * Fix default snapshot threads // min 1 * correct before_script for nightly build versions (#9543) - fix gitlab array of strings syntax error - get proper commit id - avoid colon in stings * Remove initial token for WS. (#9545) * version: mark release critical * ci: fix rpc docs generation 2 (#9550) * Improve P2P discovery (#9526) * Add `target` to Rust traces * network-devp2p: Don't remove discovery peer in main sync * network-p2p: Refresh discovery more often * Update Peer discovery protocol * Run discovery more often when not enough nodes connected * Start the first discovery early * Update fast discovery rate * Fix tests * Fix `ping` tests * Fixing remote Node address ; adding PingPong round * Fix tests: update new +1 PingPong round * Increase slow Discovery rate Check in flight FindNode before pings * Add `deprecated` to deprecated_echo_hash * Refactor `discovery_round` branching * net_version caches network_id to avoid redundant aquire of sync read lock (#9544) * net_version caches network_id to avoid redundant aquire of sync read lock, #8746 * use lower_hex display formatting for net_peerCount rpc method * Increase Gas-floor-target and Gas Cap (#9564) + Gas-floor-target increased to 8M by default + Gas-cap increased to 10M by default * Revert to old parity-tokio-ipc. * Downgrade named pipes.
It's unbelievable, finally someone was able to find the P2P discovery issues and fix them. Great job! |
Hell, yes! |
Removes support for `deprecated_echo_hash` ping/pong in devp2p-discovery. It will not work with nodes prior to `stable 2.0.5` (#9526)
Removes support for `deprecated_echo_hash` ping/pong in devp2p-discovery. It will not work with nodes prior to `stable 2.0.5` (#9526)
This PR fixes a few things in the P2P discovery protocol.
To
field in thePong
packet that is actually uselessFindNode
requests that have been answered but with fewer nodes than expected (basically it just checks that some packets have been received)parity-ethereum
versions used to send back inPong
packetsHere are some comparaisons between current master and this branch. Look at the number of peers growing.
Running node for 1m30 on
master
:Running node for 1m30 on
ng-p2p-discovery
This PR enlights an issue that I still haven't been able to resolve: when there are many peers connected, there are from time to time some peers dropping all at the same time because of timeouts. It seems that something is holding the timer's thread for heavy processing, and when the thread is free again, many requests timed out and thus those peers are removed. If anyone is willing to look into it, that would be great.
fixes #9549